修剪是卷积神经网络(CNNS)模型压缩的有效技术,但是由于较大的设计空间,很难找到最佳的修剪政策。为了提高修剪的可用性,已经开发了许多自动修剪方法。最近,由于其坚实的理论基础和高采样效率,贝叶斯优化(BO)被认为是一种自动修剪的竞争算法。但是,BO受到维度的诅咒。由于设计空间的尺寸增加,在修剪深CNN时,BO的性能会恶化。我们提出了一种新颖的聚类算法,该算法降低了设计空间的尺寸以加快搜索过程。随后,提出了回滚算法以恢复高维设计空间,以便获得更高的修剪精度。我们验证了有关Resnet,MobilenetV1和MobilenetV2模型的建议方法。实验表明,提出的方法在修剪深CNN而不会增加运行时间时显着提高BO的收敛速率。源代码可在https://github.com/fanhanwei/bocr上获得。
translated by 谷歌翻译
Deep learning-based full-reference image quality assessment (FR-IQA) models typically rely on the feature distance between the reference and distorted images. However, the underlying assumption of these models that the distance in the deep feature domain could quantify the quality degradation does not scientifically align with the invariant texture perception, especially when the images are generated artificially by neural networks. In this paper, we bring a radical shift in inferring the quality with learned features and propose the Deep Image Dependency (DID) based FR-IQA model. The feature dependency facilitates the comparisons of deep learning features in a high-order manner with Brownian distance covariance, which is characterized by the joint distribution of the features from reference and test images, as well as their marginal distributions. This enables the quantification of the feature dependency against nonlinear transformation, which is far beyond the computation of the numerical errors in the feature space. Experiments on image quality prediction, texture image similarity, and geometric invariance validate the superior performance of our proposed measure.
translated by 谷歌翻译
自然图像的统计规律(称为自然场景统计数据)在不引用图像质量评估中起重要作用。但是,人们普遍认为,通常是计算机生成的屏幕内容图像(SCI)不持有此类统计信息。在这里,我们首次尝试学习SCI的统计数据,基于可以有效确定SCI的质量。所提出的方法的基本机制是基于一个狂野的假设,即没有物理上获得的SCI仍然遵守某些可以以学习方式理解的统计数据。我们从经验上表明,在质量评估中可以有效利用统计偏差,并且在不同的环境中进行评估时,提出的方法优越。广泛的实验结果表明,与现有的NR-IQA模型相比,基于深度统计的SCI质量评估(DFSS-IQA)模型可提供有希望的性能,并在跨数据库设置中显示出很高的概括能力。我们的方法的实现可在https://github.com/baoliang93/dfss-iqa上公开获得。
translated by 谷歌翻译
KDD CUP 2022提出了有关空间动态风能数据集的时间序列预测任务,其中要求参与者预测未来一代给定历史上下文因素。评估指标包含RMSE和MAE。本文介绍了团队88VIP的解决方案,该解决方案主要包括两种模型:梯度增强决策树,以记住基本数据模式和复发性神经网络,以捕获深层和潜在的概率过渡。结合这些模型有助于应对风力的波动,以及训练子模型对预测的异质时间尺度(从几分钟到几天)的杰出特性的目标。此外,还详细介绍了功能工程,插补技术和离线评估的设计。拟议的解决方案在第3阶段的总体在线得分为-45.213。
translated by 谷歌翻译
现有的基于深度学习的全参考IQA(FR-IQA)模型通常通过明确比较特征,以确定性的方式预测图像质量,从而衡量图像严重扭曲的图像是多远,相应的功能与参考的空间相对远。图片。本文中,我们从不同的角度看这个问题,并提议从统计分布的角度对知觉空间中的质量降解进行建模。因此,根据深度特征域中的Wasserstein距离来测量质量。更具体地说,根据执行最终质量评分,测量了预训练VGG网络的每个阶段的1Dwasserstein距离。 Deep Wasserstein距离(DEEPWSD)在神经网络的功能上执行的,可以更好地解释由各种扭曲引起的质量污染,并提出了高级质量预测能力。广泛的实验和理论分析表明,在质量预测和优化方面,提出的DEEPWSD的优越性。
translated by 谷歌翻译
在本文中,我们提出了通过特征级伪参考(PR)幻觉提出的无引用(NR)图像质量评估(IQA)方法。提出的质量评估框架基于自然图像统计行为的先前模型,并植根于以下观点,即可以很好地利用具有感知意义的特征来表征视觉质量。本文中,通过以原始参考为监督的相互学习方案学习了扭曲的图像中的PR特征,并通过三重态约束进一步确保PR特征的区分特性。给定质量推断的扭曲图像,特征水平的分离是用可逆神经层进行最终质量预测的,导致PR和相应的失真特征以进行比较。在四个流行的IQA数据库中证明了我们提出的方法的有效性,跨数据库评估的卓越性能也揭示了我们方法的高概括能力。我们的方法的实现可在https://github.com/baoliang93/fpr上公开获得。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
The visual dimension of cities has been a fundamental subject in urban studies, since the pioneering work of scholars such as Sitte, Lynch, Arnheim, and Jacobs. Several decades later, big data and artificial intelligence (AI) are revolutionizing how people move, sense, and interact with cities. This paper reviews the literature on the appearance and function of cities to illustrate how visual information has been used to understand them. A conceptual framework, Urban Visual Intelligence, is introduced to systematically elaborate on how new image data sources and AI techniques are reshaping the way researchers perceive and measure cities, enabling the study of the physical environment and its interactions with socioeconomic environments at various scales. The paper argues that these new approaches enable researchers to revisit the classic urban theories and themes, and potentially help cities create environments that are more in line with human behaviors and aspirations in the digital age.
translated by 谷歌翻译
Deploying reliable deep learning techniques in interdisciplinary applications needs learned models to output accurate and ({even more importantly}) explainable predictions. Existing approaches typically explicate network outputs in a post-hoc fashion, under an implicit assumption that faithful explanations come from accurate predictions/classifications. We have an opposite claim that explanations boost (or even determine) classification. That is, end-to-end learning of explanation factors to augment discriminative representation extraction could be a more intuitive strategy to inversely assure fine-grained explainability, e.g., in those neuroimaging and neuroscience studies with high-dimensional data containing noisy, redundant, and task-irrelevant information. In this paper, we propose such an explainable geometric deep network dubbed as NeuroExplainer, with applications to uncover altered infant cortical development patterns associated with preterm birth. Given fundamental cortical attributes as network input, our NeuroExplainer adopts a hierarchical attention-decoding framework to learn fine-grained attentions and respective discriminative representations to accurately recognize preterm infants from term-born infants at term-equivalent age. NeuroExplainer learns the hierarchical attention-decoding modules under subject-level weak supervision coupled with targeted regularizers deduced from domain knowledge regarding brain development. These prior-guided constraints implicitly maximizes the explainability metrics (i.e., fidelity, sparsity, and stability) in network training, driving the learned network to output detailed explanations and accurate classifications. Experimental results on the public dHCP benchmark suggest that NeuroExplainer led to quantitatively reliable explanation results that are qualitatively consistent with representative neuroimaging studies.
translated by 谷歌翻译